Metrics for Mining Multisets

نویسندگان

  • Walter A. Kosters
  • Jeroen F. J. Laros
چکیده

A multiset (also referred to as a bag) is a set (collection of elements where the order is of no importance), where the elements do not need to be unique. A vase with n blue and m red marbles is a multiset for example. We propose a new class of distance measures (metrics) designed for multisets, both of which are a recurrent theme in many data mining [2] applications. One particular instance of this class originated from the necessity for a clustering of criminal behaviours. Here the multisets are the crimes committed in one year. This metric generalises well-known distance measures like the Jaccard and the Canberra distance. These distance measures are parameterised by a function f which, given a few simple restrictions, will always produce a valid metric. This flexibility allows these measures to be tailored for many domain-specific applications. The metrics in this class can be efficiently calculated. In the full paper, all proofs are given and various applications are shown.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

General definitions for the union and intersection of ordered fuzzy multisets

Since its original formulation, the theory of fuzzy sets has spawned a number of extensions where the role of membership values in the real unit interval $[0, 1]$ is handed over to more complex mathematical entities. Amongst the many existing extensions, two similar ones, the fuzzy multisets and the hesitant fuzzy sets, rely on collections of several distinct values to represent fuzzy membershi...

متن کامل

Structural Classification of XML Documents Using Multisets

In this paper, we investigate the problem of clustering XML documents based on their structure. We represent the paths in an XML document as a multiset and use the symmetric difference operation on multisets to define certain metrics. These metrics are then used to obtain a measure of similarity between any two documents in a collection. Our technique was successfully applied to real and synthe...

متن کامل

Learning Rules from Very Large Databases Using Rough Multisets

This paper presents a mechanism called LERS-M for learning production rules from very large databases. It can be implemented using objectrelational database systems, it can be used for distributed data mining, and it has a structure that matches well with parallel processing. LERS-M is based on rough multisets and it is formulated using relational operations with the objective to be tightly cou...

متن کامل

ON GENERALIZED FUZZY MULTISETS AND THEIR USE IN COMPUTATION

An orthogonal approach to the fuzzification of both multisets and hybridsets is presented. In particular, we introduce $L$-multi-fuzzy and$L$-fuzzy hybrid sets, which are general enough and in spirit with thebasic concepts of fuzzy set theory. In addition, we study the properties ofthese structures. Also, the usefulness of these structures is examined inthe framework of mechanical multiset proc...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007